Project Title: Forecasting Herbicide Sales Using Historical Data Analysis¶

Introduction¶

This project aims to analyze historical data on herbicide production and market values from 2000 to 2023 and forecast sales for the next five years (2024-2028). Utilizing Python, we will process and analyze the dataset, identify trends, and implement a suitable forecasting method. The analysis will consider various factors such as area harvested, yield, and consumer price index changes, to provide accurate predictions for future market value of herbicides.

Key Steps in the Project¶

  1. Data Loading and Preprocessing: Import the dataset into Python and perform necessary data cleaning and preprocessing steps to ensure the data is ready for analysis.

  2. Exploratory Data Analysis (EDA): Conduct a thorough analysis of the data to identify trends, patterns, and correlations that may be useful for forecasting.

  3. Model Selection and Implementation: Choose an appropriate forecasting model based on the insights gained from the EDA and implement it using Python.

  4. Sales Forecast Generation: Generate sales forecasts for the years 2024-2028 using the selected model.

  5. Model Evaluation and Presentation: Evaluate the accuracy of the forecasted sales data, ensuring clarity and organization of the code, and provide proper documentation and comments explaining the steps taken.

By following these steps, we aim to deliver a robust and accurate forecast of herbicide sales, supported by clear and well-documented code.

In [1]:
#Importing necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.holtwinters import ExponentialSmoothing
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

The libraries used in this code are as follows:

pandas: Provides data structures and data analysis tools for handling and manipulating structured data.

numpy: Supports large, multi-dimensional arrays and matrices, along with mathematical functions to operate on them.

matplotlib.pyplot: A plotting library used for creating static, animated, and interactive visualizations.

seaborn: A data visualization library based on matplotlib for drawing attractive and informative statistical graphics.

sklearn.model_selection.train_test_split: Splits datasets into training and testing sets for model evaluation.

sklearn.metrics.mean_squared_error: Evaluates regression model performance by calculating the mean squared error.

statsmodels.tsa.holtwinters.ExponentialSmoothing: Implements the Holt-Winters Exponential Smoothing method for time series forecasting.

warnings: Manages the display of warning messages to ensure cleaner output during code execution.

variance_inflation_factor from statsmodels.stats.outliers_influence: Calculates the Variance Inflation Factor (VIF) to detect multicollinearity in a set of predictor variables.

train_test_split from sklearn.model_selection: Splits arrays or matrices into random train and test subsets for model evaluation.

StandardScaler from sklearn.preprocessing: Standardizes features by removing the mean and scaling to unit variance.

LinearRegression from sklearn.linear_model: Implements ordinary least squares linear regression.

GradientBoostingRegressor from sklearn.ensemble: Implements gradient boosting for regression problems.

SVR from sklearn.svm: Supports vector regression for predicting continuous outputs.

xgb (XGBoost): Provides an optimized gradient boosting library designed for speed and performance.

MLPRegressor from sklearn.neural_network: Implements a multi-layer perceptron regressor for modeling complex relationships.

mean_squared_error, r2_score, mean_absolute_error from sklearn.metrics: Provides metrics to evaluate the performance of regression models.

plot_acf from statsmodels.graphics.tsaplots: Plots the autocorrelation function (ACF) for time series analysis.

boxcox from scipy.stats: Applies the Box-Cox transformation to make data more normally distributed.

Holt from statsmodels.tsa.holtwinters: Implements Holt’s linear trend method for forecasting.

go from plotly.graph_objects: Provides a high-level interface for creating interactive visualizations in Plotly.

This data dictionary provides a description of each column in the dataset and their corresponding attributes.

Column Name Description
Year The year of the recorded data
Market_Value Market value in USD
Area_Harvested Area harvested in thousands of hectares (000 Ha)
Yield Yield in tons per hectare (Tons/Ha)
CPI Consumer Price Index, Year-on-Year percent change

Load the Data¶

In [2]:
# Load the data from the Excel files
market_value_data = pd.read_excel('Agrochemical Market Value Data.xlsx')
variables_data = pd.read_excel('Variables Data.xlsx')

# Save the data to CSV files
market_value_data.to_csv('Agrochemical_Market_Value_Data.csv', index=False)
variables_data.to_csv('Variables_Data.csv', index=False)

print("Data has been successfully converted to CSV format.")
Data has been successfully converted to CSV format.

We converted the data from Excel to CSV format to simplify data handling, improve performance, ensure compatibility with various tools, and facilitate easier sharing and inspection. CSV files are lightweight, faster to read/write, and widely supported, making them ideal for data analysis and forecasting tasks.

In [3]:
# Load the data from the CSV files
market_value_data = pd.read_csv('Agrochemical_Market_Value_Data.csv')
variables_data = pd.read_csv('Variables_Data.csv')

# Display the first few rows of each DataFrame to verify the data
print("Market Value Data:")
print(market_value_data.head())

print("\nVariables Data:")
print(variables_data.head())
Market Value Data:
   Year     Product Market Value (USD)
0  2000  Herbicides   103.298833767801
1  2001  Herbicides   114.172395217043
2  2002  Herbicides    90.613012077019
3  2003  Herbicides   108.735614492422
4  2004  Herbicides   117.796915700124

Variables Data:
   Year     Product  Area Harvested (000 Ha)  Yield (Tons/Ha)  \
0  2000  Herbicides                   3100.0              5.5   
1  2001  Herbicides                   2816.0              5.5   
2  2002  Herbicides                   2420.0              6.1   
3  2003  Herbicides                   2450.0              6.3   
4  2004  Herbicides                   2339.0              6.4   

   Consumer Price Index, Year-on-Year Percent Change  
0                                               -0.9  
1                                               -1.1  
2                                               25.9  
3                                               13.4  
4                                                4.4  
In [4]:
# Merge the two DataFrames on the 'Year' and 'Product' columns
merged_data = pd.merge(market_value_data, variables_data, on=['Year', 'Product'])

# Display the first few rows of the merged DataFrame to verify the merge
print("\nMerged Data:")
print(merged_data.head())
Merged Data:
   Year     Product Market Value (USD)  Area Harvested (000 Ha)  \
0  2000  Herbicides   103.298833767801                   3100.0   
1  2001  Herbicides   114.172395217043                   2816.0   
2  2002  Herbicides    90.613012077019                   2420.0   
3  2003  Herbicides   108.735614492422                   2450.0   
4  2004  Herbicides   117.796915700124                   2339.0   

   Yield (Tons/Ha)  Consumer Price Index, Year-on-Year Percent Change  
0              5.5                                               -0.9  
1              5.5                                               -1.1  
2              6.1                                               25.9  
3              6.3                                               13.4  
4              6.4                                                4.4  

Data Prepocessing¶

Checking and updating datatype: Ensures data integrity and compatibility by converting columns to appropriate types, thereby enabling accurate data analysis, efficient storage, and proper execution of operations and algorithms.

In [5]:
# Check data types in the Market Value Data
print("\nData Types in Merged Data:")
print(merged_data.dtypes)
Data Types in Merged Data:
Year                                                   int64
Product                                               object
Market Value (USD)                                    object
Area Harvested (000 Ha)                              float64
Yield (Tons/Ha)                                      float64
Consumer Price Index, Year-on-Year Percent Change    float64
dtype: object
In [6]:
# Drop the 'Product' column as it is redundant
merged_data.drop(columns=['Product'], inplace=True)


# Convert 'Market Value (USD)' to float
merged_data['Market Value (USD)'] = merged_data['Market Value (USD)'].replace('?', 0).astype(float)

# Display the first few rows of the DataFrame after type conversion
print("\nData after type conversion:")
print(merged_data.head())

# Check data types to confirm the conversion
print("\nData Types in Merged Data after conversion:")
print(merged_data.dtypes)
Data after type conversion:
   Year  Market Value (USD)  Area Harvested (000 Ha)  Yield (Tons/Ha)  \
0  2000          103.298834                   3100.0              5.5   
1  2001          114.172395                   2816.0              5.5   
2  2002           90.613012                   2420.0              6.1   
3  2003          108.735614                   2450.0              6.3   
4  2004          117.796916                   2339.0              6.4   

   Consumer Price Index, Year-on-Year Percent Change  
0                                               -0.9  
1                                               -1.1  
2                                               25.9  
3                                               13.4  
4                                                4.4  

Data Types in Merged Data after conversion:
Year                                                   int64
Market Value (USD)                                   float64
Area Harvested (000 Ha)                              float64
Yield (Tons/Ha)                                      float64
Consumer Price Index, Year-on-Year Percent Change    float64
dtype: object

We deleted the redundant column 'Product' because it contains the same value ('Herbicides') for all rows, providing no additional information for the forecasting analysis.

This code will rename the columns as follows:

'Year' -> 'Year' 'Market Value (USD)' -> 'Market_Value' 'Area Harvested (000 Ha)' -> 'Area_Harvested' 'Yield (Tons/Ha)' -> 'Yield' 'Consumer Price Index, Year-on-Year Percent Change' -> 'CPI' These new column names are simpler and devoid of extra symbols or characters, making them more suitable for faster data processing.

In [7]:
# Renaming columns
merged_data.columns = ['Year', 'Market_Value', 'Area_Harvested', 'Yield', 'CPI']

print(merged_data)
    Year  Market_Value  Area_Harvested  Yield    CPI
0   2000    103.298834     3100.000000    5.5   -0.9
1   2001    114.172395     2816.000000    5.5   -1.1
2   2002     90.613012     2420.000000    6.1   25.9
3   2003    108.735614     2450.000000    6.3   13.4
4   2004    117.796916     2339.000000    6.4    4.4
5   2005    132.294998     2783.000000    7.4    9.6
6   2006     94.237533     2440.000000    6.5   10.9
7   2007    117.796916     2800.000000    8.0    8.8
8   2008    168.540202     3412.000000    6.5    8.6
9   2009    126.858217     2500.000000    6.2    6.3
10  2010    154.042121     3000.000000    8.3   10.5
11  2011    199.348627     3750.000000    6.7    9.8
12  2012    302.647460     3600.000000    5.8   10.0
13  2013    537.428899     4000.000000    6.8   10.6
14  2014    540.053552     3400.000000    7.6   21.4
15  2015    457.233259     3500.000000    8.5   16.4
16  2016    498.371566     3700.000000    8.0   37.5
17  2017    509.245128     4900.000000    8.4   25.4
18  2018    525.555470     5200.000000    6.2   34.2
19  2019    572.674236     6100.000000    8.4   52.8
20  2020    612.169011     6300.000000    8.1   40.5
21  2021    625.000000     6550.000000    7.9   47.1
22  2022    900.000000     7100.000000    7.0   73.1
23  2023    682.838834     6750.000000    5.3  134.0
24  2024      0.000000     6800.000000    8.1  272.2
25  2025      0.000000     7200.000000    7.6   70.9
26  2026      0.000000     7132.197788    7.7   21.2
27  2027      0.000000     7221.251348    7.7   11.8
28  2028      0.000000     7244.102280    7.8    5.0

Exploratory Data Analysis¶

This code will filter the merged_data dataframe to only include rows where the Year is less than or equal to 2023, and then calculate the summary statistics for the filtered data. The summary statistics will include mean, variance, median, range, and other standard measures.

We include data up to the year 2023 in our analysis to study historical trends and inform predictive modeling for subsequent years using the specified variables.

In [8]:
# Filter data to only include years up to 2023
filtered_data = merged_data[merged_data['Year'] <= 2023]

# Calculate summary statistics
summary_stats = filtered_data.describe().transpose()

# Calculate additional statistics
summary_stats['variance'] = filtered_data.var()
summary_stats['median'] = filtered_data.median()
summary_stats['range'] = filtered_data.max() - filtered_data.min()

# Display the summary statistics
print(summary_stats)
                count         mean          std          min          25%  \
Year             24.0  2011.500000     7.071068  2000.000000  2005.750000   
Market_Value     24.0   345.456367   244.623899    90.613012   117.796916   
Area_Harvested   24.0  3954.583333  1550.849779  2339.000000  2795.750000   
Yield            24.0     6.975000     1.038079     5.300000     6.200000   
CPI              24.0    25.383333    29.578939    -1.100000     9.400000   

                        50%          75%     max      variance       median  \
Year            2011.500000  2017.250000  2023.0  5.000000e+01  2011.500000   
Market_Value     250.998043   538.085062   900.0  5.984085e+04   250.998043   
Area_Harvested  3456.000000  4975.000000  7100.0  2.405135e+06  3456.000000   
Yield              6.750000     8.000000     8.5  1.077609e+00     6.750000   
CPI               12.150000    35.025000   134.0  8.749136e+02    12.150000   

                      range  
Year              23.000000  
Market_Value     809.386988  
Area_Harvested  4761.000000  
Yield              3.200000  
CPI              135.100000  
In [9]:
# Function to find the year and value corresponding to the min and max values
def find_min_max_years_values(df, column):
    min_row = df.loc[df[column] == df[column].min()]
    max_row = df.loc[df[column] == df[column].max()]
    min_year = min_row['Year'].values[0]
    max_year = max_row['Year'].values[0]
    min_value = min_row[column].values[0]
    max_value = max_row[column].values[0]
    return min_year, min_value, max_year, max_value

# Dictionary to hold the results for specific columns
specific_results = {}

# Columns to analyze
columns_to_analyze = ['Area_Harvested', 'Market_Value', 'Yield']

# Find years and values for each specified column
for column in columns_to_analyze:
    min_year, min_value, max_year, max_value = find_min_max_years_values(filtered_data, column)
    specific_results[column] = {
        'Min Year': min_year,
        'Min Value': min_value,
        'Max Year': max_year,
        'Max Value': max_value
    }

# Display the results
for column, details in specific_results.items():
    print(f"{column}: {details}")
Area_Harvested: {'Min Year': 2004, 'Min Value': 2339.0, 'Max Year': 2022, 'Max Value': 7100.0}
Market_Value: {'Min Year': 2002, 'Min Value': 90.613012077019, 'Max Year': 2022, 'Max Value': 900.0}
Yield: {'Min Year': 2023, 'Min Value': 5.3, 'Max Year': 2015, 'Max Value': 8.5}

The dataset covers the period from 2000 to 2023, with a focus on forecasting market values for the subsequent five years. The analysis considers historical data until 2023 to make predictions.

Summary Statistics:¶

  • Market Value:

    • Mean: USD 345.46
    • Standard Deviation: USD 244.62
    • Minimum: USD 90.61 (Year: 2002)
    • Maximum: USD 900 (Year: 2022)
    • Median: USD 250.99
    • Variance: USD 59840.85
    • Range: 809.39
  • Area Harvested:

    • Mean: 3954.58 thousand hectares
    • Standard Deviation: 1550.85 thousand hectares
    • Minimum: 2339.00 (Year: 2004)
    • Maximum: 7100.00 (Year: 2022)
    • Median: 3456.00 thousand hectares
    • Variance: 2,405,135.00
    • Range: 4761.00 thousand hectares
  • Yield:

    • Mean: 6.98 tons/ha
    • Standard Deviation: 1.04 tons/ha
    • Minimum: 5.30 tons/ha (Year: 2023)
    • Maximum: 8.50 tons/ha (Year: 2015)
    • Median: 6.75 tons/ha
    • Variance: 1.08
    • Range: 3.20 tons/ha
  • CPI Year-on-Year Percent Change:

    • Mean: 25.38%
    • Standard Deviation: 29.58%
    • Minimum: -1.10%
    • Maximum: 134.00%
    • Median: 12.15%
    • Variance: 874.91
    • Range: 135.10

Key Observations:¶

  • The Market Value peaked in 2022 at USD 900.00 and hit its lowest point in 2002 at USD 90.61.
  • Area Harvested reached its maximum in 2022 (7100.00 thousand hectares) and its minimum in 2004 (2339.00 thousand hectares).
  • The Yield was highest in 2015 at 8.50 tons/ha and lowest in 2023 at 5.30 tons/ha.
  • The CPI Year-on-Year Percent Change shows substantial variability, with a peak value of 134.00% and a low of -1.10%.

This dataset reveals significant variability in market values and harvested areas, with notable peaks observed in 2022. Yield data shows more consistency, with the lowest yield noted in 2023. These trends and statistics are essential for forecasting future market values and understanding agricultural trends.

In [10]:
# List of features to create box plots for
features = ['Market_Value', 'Area_Harvested', 'Yield', 'CPI']

# Create box plots
plt.figure(figsize=(14, 10))

for i, feature in enumerate(features, 1):
    plt.subplot(2, 2, i)
    sns.boxplot(y=filtered_data[feature])
    plt.title(f'Box Plot of {feature}')
    plt.xlabel('')

plt.tight_layout()
plt.show()

Box Plot Analysis¶

  1. Market Value: The box plot shows a relatively wide distribution of values with no significant outliers.
  2. Area Harvested: Similar to market value, the area harvested shows a broad range with no outliers, indicating consistent data distribution.
  3. Yield: The yield data is more tightly clustered, suggesting less variability, with no apparent outliers.
  4. CPI YoY Percent Change: There is a noticeable outlier in the CPI YoY Percent Change, indicating a significant deviation from the typical range of values.
In [11]:
# List of features to plot against Year
features = ['Market_Value', 'Area_Harvested', 'Yield', 'CPI']

# Plot each feature against Year
plt.figure(figsize=(12, 8))

for i, feature in enumerate(features, 1):
    plt.subplot(2, 2, i)
    plt.plot(filtered_data['Year'], filtered_data[feature], marker='o')
    plt.title(f'Year vs {feature}')
    plt.xlabel('Year')
    plt.ylabel(feature)
    plt.grid(True)

plt.tight_layout()
plt.show()

Visualization Description and Insights¶

We have created separate line charts for each of the four variables against the year: Area Harvested, Yield, Consumer Price Index (CPI) Year-on-Year Percent Change, and Market Value.

Area Harvested (000 Ha):¶

  • Trend: There is a general increasing trend in the area harvested from 2000 to 2023.
  • Insights: The area harvested starts at 3,100 thousand hectares in 2000, with notable increases over the years, peaking at 7,100 thousand hectares in 2022. This suggests expanding agricultural activities over the years.

Yield (Tons/Ha):¶

  • Trend: Yield shows fluctuations over the period with both increases and decreases.
  • Insights: Yield varies, reaching a high of 8.5 tons/ha in 2015 and a low of 5.3 tons/ha in 2023. The variability indicates changing agricultural productivity, potentially due to varying agricultural practices, climatic conditions, or technological advancements.

Consumer Price Index (CPI) Year-on-Year Percent Change:¶

  • Trend: CPI shows significant variability with several peaks and troughs.
  • Insights: The CPI percent change shows extreme fluctuations, with a high of 134% in 2023. This volatility reflects the economic instability and varying inflation rates affecting the cost of herbicides.

Market Value (USD):¶

  • Trend: The market value of herbicides shows a significant upward trend, particularly from 2011 onwards.
  • Seasonality: There is no obvious sesonality in the data.
  • Insights: Starting at USD 103 in 2000, the market value peaks at USD 900 in 2022 before slightly declining to USD 683 in 2023. This sharp rise indicates increasing demand or pricing power of herbicides in the market.

Summary¶

The visualizations indicate a general growth in agricultural activities and market values for herbicides, despite notable variability in yield and CPI changes. The significant peak in market value in 2022 suggests a potential anomaly or an impactful market event. These insights are crucial for understanding historical trends and informing future market predictions.

In [12]:
# Calculate the correlation matrix
correlation_matrix = filtered_data[['Market_Value', 'Area_Harvested', 'Yield', 'CPI']].corr()

# Print correlation values
print("Correlation Matrix:")
print(correlation_matrix)

# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()
Correlation Matrix:
                Market_Value  Area_Harvested     Yield       CPI
Market_Value        1.000000        0.898081  0.331799  0.732717
Area_Harvested      0.898081        1.000000  0.227519  0.797040
Yield               0.331799        0.227519  1.000000 -0.009147
CPI                 0.732717        0.797040 -0.009147  1.000000

The correlation matrix provides insights into the relationships between different variables in the dataset up to the year 2023. Here is a summary of the key findings:

Market Value:

  • Area Harvested: The Market Value has a strong positive correlation with Area Harvested (0.898), indicating that as the area harvested increases, the market value tends to increase significantly.
  • Yield: The correlation between Market Value and Yield is moderate (0.332), suggesting a weaker but positive relationship.
  • CPI: There is a strong positive correlation (0.733) between Market Value and CPI, implying that higher inflation rates are associated with higher market values. We retain the outlier detected earlier due to its high correlation with Market Value and limited data points.

Insights:

  • The Market Value of herbicides is strongly influenced by both the Area Harvested and the CPI. This suggests that market value is likely driven by both production scale and economic conditions.
  • The weak correlation between Yield and other variables, especially CPI, indicates that yield variations are relatively independent of economic inflation and harvested area.
  • The strong inter-correlation between Market Value, Area Harvested, and CPI highlights the interconnected nature of agricultural production and economic factors.
  • Multicollinearity, indicated by high correlation between independent variables, can impair regression model performance by obscuring individual variable effects on the dependent variable. The correlation matrix shows significant correlations, particularly between Market_Value and Area_Harvested (0.898081), and between Area_Harvested and CPI (0.797040). To quantify multicollinearity, we calculate the Variance Inflation Factor (VIF), with values exceeding 10 suggesting high multicollinearity.
In [13]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Calculate VIF
X = filtered_data[['Area_Harvested', 'Yield', 'CPI']]
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]

print(vif_data)
          feature        VIF
0  Area_Harvested  25.005225
1           Yield  13.495178
2             CPI   5.415824

The VIF values indicate significant multicollinearity among the variables:

  • Area_Harvested (VIF: 25.005225): Very high multicollinearity.
  • Yield (VIF: 13.495178): High multicollinearity.
  • CPI (VIF: 5.415824): Moderate multicollinearity.
  • Variable elimination:Remove Area_Harvested (VIF: 25.005225) due to highest multicollinearity.
  • VIF recalculation:Recompute VIF for remaining variables post-elimination.
In [14]:
# Remove the variable with the highest VIF (Area_Harvested)
X = filtered_data[['Yield', 'CPI']]

# Recalculate VIF
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]

print(vif_data)
  feature       VIF
0   Yield  1.736739
1     CPI  1.736739
In [15]:
# Remove the variable with the highest VIF (Area_Harvested)
X_df = filtered_data[['Yield', 'Area_Harvested']]

# Recalculate VIF
vif_data = pd.DataFrame()
vif_data["feature"] = X_df.columns
vif_data["VIF"] = [variance_inflation_factor(X_df.values, i) for i in range(len(X_df.columns))]

print(vif_data)
          feature       VIF
0           Yield  8.018639
1  Area_Harvested  8.018639

We replace outliers with the mean of neighboring values to ensure data consistency and accuracy, thereby mitigating the impact of anomalous values on analysis and model performance.

We create new columns Market_Value_New and CIP_New to retain our original data and keep a data ready for analysis.

In [16]:
# Function to replace outliers with mean of neighboring values
def replace_outliers(df, year_col, value_col, outlier_years):
    new_col = f'{value_col}_New'
    df[new_col] = df[value_col]  # Create the new column
    for year in outlier_years:
        if year in df[year_col].values:
            idx = df.index[df[year_col] == year][0]
            if idx > 0 and idx < len(df) - 1:
                # Calculate mean of neighboring values
                mean_value = (df.at[idx - 1, value_col] + df.at[idx + 1, value_col]) / 2
                df.at[idx, new_col] = mean_value
            elif idx == 0:
                # If the outlier is at the first row, only take the next value
                df.at[idx, new_col] = df.at[idx + 1, value_col]
            elif idx == len(df) - 1:
                # If the outlier is at the last row, only take the previous value
                df.at[idx, new_col] = df.at[idx - 1, value_col]
    return df

# Outlier years identified from visualization
outlier_years_market_value = [2013, 2022]
outlier_years_cpi = [2002, 2023]

# Replace outliers in 'Market_Value' column
filtered_data = replace_outliers(filtered_data, 'Year', 'Market_Value', outlier_years_market_value)

# Replace outliers in 'CPI' column
filtered_data = replace_outliers(filtered_data, 'Year', 'CPI', outlier_years_cpi)

# Display the modified DataFrame
print(filtered_data)
    Year  Market_Value  Area_Harvested  Yield    CPI  Market_Value_New  \
0   2000    103.298834          3100.0    5.5   -0.9        103.298834   
1   2001    114.172395          2816.0    5.5   -1.1        114.172395   
2   2002     90.613012          2420.0    6.1   25.9         90.613012   
3   2003    108.735614          2450.0    6.3   13.4        108.735614   
4   2004    117.796916          2339.0    6.4    4.4        117.796916   
5   2005    132.294998          2783.0    7.4    9.6        132.294998   
6   2006     94.237533          2440.0    6.5   10.9         94.237533   
7   2007    117.796916          2800.0    8.0    8.8        117.796916   
8   2008    168.540202          3412.0    6.5    8.6        168.540202   
9   2009    126.858217          2500.0    6.2    6.3        126.858217   
10  2010    154.042121          3000.0    8.3   10.5        154.042121   
11  2011    199.348627          3750.0    6.7    9.8        199.348627   
12  2012    302.647460          3600.0    5.8   10.0        302.647460   
13  2013    537.428899          4000.0    6.8   10.6        421.350506   
14  2014    540.053552          3400.0    7.6   21.4        540.053552   
15  2015    457.233259          3500.0    8.5   16.4        457.233259   
16  2016    498.371566          3700.0    8.0   37.5        498.371566   
17  2017    509.245128          4900.0    8.4   25.4        509.245128   
18  2018    525.555470          5200.0    6.2   34.2        525.555470   
19  2019    572.674236          6100.0    8.4   52.8        572.674236   
20  2020    612.169011          6300.0    8.1   40.5        612.169011   
21  2021    625.000000          6550.0    7.9   47.1        625.000000   
22  2022    900.000000          7100.0    7.0   73.1        653.919417   
23  2023    682.838834          6750.0    5.3  134.0        682.838834   

    CPI_New  
0     -0.90  
1     -1.10  
2      6.15  
3     13.40  
4      4.40  
5      9.60  
6     10.90  
7      8.80  
8      8.60  
9      6.30  
10    10.50  
11     9.80  
12    10.00  
13    10.60  
14    21.40  
15    16.40  
16    37.50  
17    25.40  
18    34.20  
19    52.80  
20    40.50  
21    47.10  
22    73.10  
23    73.10  

This step replaces outliers in 'Market_Value' for 2013 and 2022 with the mean of neighboring values to prevent skewing statistical analysis and improve model performance.

Ex-Post Forecasting using Regression models¶

The rationale for using ex post forecasting in this analysis is to validate model accuracy by comparing predicted values against known outcomes, ensuring robustness in temporal prediction.

Regression Analysis Using Different Variable Combinations¶

In the regression analysis, various combinations of variables will be utilized to determine the optimal model. The following combinations have been selected based on Variance Inflation Factor (VIF) scores to address multicollinearity:

  1. Combination 1:

    • Variables: [['Yield', 'CPI']]
    • Rationale: This combination is selected due to the lowest VIF scores among all variables, indicating minimal multicollinearity and potential for a more stable model.
  2. Combination 2:

    • Variables: [['Yield', 'Area_Harvested']]
    • Rationale: Although the VIF scores for this combination are higher than the first, they are still relatively low, suggesting a moderate level of multicollinearity which is acceptable for analysis.

The regression models will be applied to each of these combinations to assess their performance and identify the most suitable model for forecasting.

Combination 1: [['Yield', 'CPI']]¶

The code in following cell performs the following steps for training and evaluating multiple regression models using the filtered_data dataset:

  1. Import Libraries:
    • train_test_split and StandardScaler from sklearn.model_selection and sklearn.preprocessing for data splitting and scaling.
    • Regression models from sklearn.linear_model, sklearn.ensemble, sklearn.svm, xgboost, and sklearn.neural_network.
    • Evaluation metrics from sklearn.metrics.
      • Scaling is done to standardize the features, ensuring they have a mean of 0 and a standard deviation of 1. This is important because different variables (e.g., Yield and CPI_New) can have different scales (units and ranges). Standardizing ensures that all features contribute equally to the model's performance, improving the convergence speed and accuracy of algorithms sensitive to the scale of input data, such as Support Vector Regression, Gradient Boosting, and Neural Networks.
  1. Prepare Features and Target:

    • Selects Yield and CPI_New as features (X) and Market_Value_New as the target variable (y).
  2. Split Data:

    • Splits the data into training and test sets using two methods:
      • Using train_test_split to split the data randomly with an 80/20 split.
      • Using a manual split with 70% of the data for training and 30% for testing.
  3. Standardize Features:

    • Applies StandardScaler to standardize the features, ensuring they have a mean of 0 and a standard deviation of 1 for both training and test sets.
  4. Initialize Models:

    • Defines a dictionary models containing five different regression models: Linear Regression, Gradient Boosting, Support Vector Regression, XGBoost, and Neural Network.
  5. Fit Models and Evaluate:

    • Iterates through each model, fits it to the standardized training data, and makes predictions on the standardized test data.
    • Prints the actual and predicted values, Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R² Score for each model.

This code essentially prepares the data, standardizes it, trains multiple regression models, and evaluates their performance using various metrics.

In [17]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Prepare features and target
X = filtered_data[['Yield', 'CPI_New']]
y = filtered_data['Market_Value_New']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
In [18]:
!pip install xgboost
Requirement already satisfied: xgboost in c:\users\admin\anaconda3\lib\site-packages (2.1.0)
Requirement already satisfied: numpy in c:\users\admin\anaconda3\lib\site-packages (from xgboost) (1.22.4)
Requirement already satisfied: scipy in c:\users\admin\anaconda3\lib\site-packages (from xgboost) (1.7.3)
In [19]:
#import necessary libraries

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
import xgboost as xgb
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error


# Prepare features and target
X = filtered_data[['Yield', 'CPI_New']]
y = filtered_data['Market_Value_New']

# Determine the split point
split_point = int(len(filtered_data) * 0.7)

# Split data into training and test sets
X_train = X[:split_point]
X_test = X[split_point:]
y_train = y[:split_point]
y_test = y[split_point:]

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


# Initialize models
models = {
    "Linear Regression": LinearRegression(),
    "Gradient Boosting": GradientBoostingRegressor(random_state=0),
    "Support Vector Regression": SVR(),
    "XGBoost": xgb.XGBRegressor(random_state=0),
    "Neural Network": MLPRegressor(random_state=0, max_iter=1000)
}

# Fit models and evaluate
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    print(f"{name}")
    print("Actual Values:   ", y_test.values)
    print("Predicted Values:", y_pred)
    print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
    print("Root Mean Squared Error:", np.sqrt(mean_squared_error(y_test, y_pred)))
    print("Mean Absolute Error:", mean_absolute_error(y_test, y_pred))
    print("R^2 Score:", r2_score(y_test, y_pred))
    print("-" * 30)
Linear Regression
Actual Values:    [498.37156642 509.24512787 525.55547005 572.67423633 612.16901125
 625.         653.9194169  682.83883379]
Predicted Values: [ 737.14440826  500.74361833  691.63282666 1027.06802931  793.77795443
  922.54325546 1430.91410567 1447.79874628]
Mean Squared Error: 200192.29554520088
Root Mean Squared Error: 447.4285368024718
Mean Absolute Error: 361.1065376112989
R^2 Score: -46.67996649335153
------------------------------
Gradient Boosting
Actual Values:    [498.37156642 509.24512787 525.55547005 572.67423633 612.16901125
 625.         653.9194169  682.83883379]
Predicted Values: [537.57974213 507.57743104 514.39334225 507.57743104 507.23704783
 537.57974213 544.49214159 537.91445722]
Mean Squared Error: 7191.5839187379215
Root Mean Squared Error: 84.80320700738812
Mean Absolute Error: 70.47983484938479
R^2 Score: -0.7128255577754168
------------------------------
Support Vector Regression
Actual Values:    [498.37156642 509.24512787 525.55547005 572.67423633 612.16901125
 625.         653.9194169  682.83883379]
Predicted Values: [129.57188841 130.35771788 129.57908887 129.55986204 129.56176223
 129.55987609 129.55986199 129.55986199]
Mean Squared Error: 211519.7189672591
Root Mean Squared Error: 459.91272972952066
Mean Absolute Error: 455.3079678871648
R^2 Score: -49.37782840531405
------------------------------
XGBoost
Actual Values:    [498.37156642 509.24512787 525.55547005 572.67423633 612.16901125
 625.         653.9194169  682.83883379]
Predicted Values: [540.0518  540.0518  540.05133 540.0518  540.0518  540.0523  540.0523
 540.05133]
Mean Squared Error: 6216.454684064235
Root Mean Squared Error: 78.84449685339005
Mean Absolute Error: 66.66558995724151
R^2 Score: -0.4805782122454232
------------------------------
Neural Network
Actual Values:    [498.37156642 509.24512787 525.55547005 572.67423633 612.16901125
 625.         653.9194169  682.83883379]
Predicted Values: [ 583.53528018  437.81269472  401.78014816  835.55253396  634.56740648
  715.37562494 1025.29641547  898.47570389]
Mean Squared Error: 36233.79460432775
Root Mean Squared Error: 190.3517654352797
Mean Absolute Error: 155.37970691101145
R^2 Score: -7.629833170933656
------------------------------

Model Performance Summary Result 1¶

Model Mean Squared Error (MSE) Root Mean Squared Error (RMSE) Mean Absolute Error (MAE) R² Score
Linear Regression 200192.30 447.43 361.11 -46.6800
Gradient Boosting 7191.58 84.80 70.48 -0.7128
Support Vector Regression 211519.72 459.91 455.31 -49.3778
XGBoost 6216.45 78.84 66.67 -0.4806
Neural Network 36233.79 190.35 155.38 -7.6298

Insights:¶

  • Linear Regression:

    • The model demonstrates significant deviations between actual and predicted values.
    • The high MSE and RMSE indicate substantial errors, and the negative R² Score suggests a poor fit.
  • Gradient Boosting:

    • This model shows a relatively better fit with lower errors and a less negative R² Score compared to Linear Regression, indicating some predictive capability.
  • Support Vector Regression:

    • The model performs poorly, with very high MSE and RMSE, and nearly constant predicted values. The extremely negative R² Score reflects its unsuitability for this dataset.
  • XGBoost:

    • XGBoost shows better performance than Gradient Boosting and Linear Regression, with lower errors and a less negative R² Score, indicating a somewhat better predictive performance.
  • Neural Network:

    • The Neural Network model performs better than most of the tested models but still has a high MSE, RMSE, and a negative R² Score, indicating a poor overall fit.

Summary:¶

  • All models show poor performance with negative R² scores, indicating that none of the models effectively capture the underlying patterns in the data.
  • XGBoost and Gradient Boosting perform relatively better compared to other models but still do not provide an adequate fit.
  • Support Vector Regression shows the highest errors and is not suitable for this dataset.
  • Further analysis and optimization, including feature engineering and model tuning, are needed to improve predictive performance.

Prediction for next 5 Years (Combination 1: [['Yield', 'CPI']])¶

In [20]:
# filtering the data required for prediction
pred_data_1 = merged_data.iloc[24:][['Year','Yield','CPI']]
print(pred_data_1)
    Year  Yield    CPI
24  2024    8.1  272.2
25  2025    7.6   70.9
26  2026    7.7   21.2
27  2027    7.7   11.8
28  2028    7.8    5.0
In [21]:
# Scaling the data
X_pred_scaled_1 = scaler.transform((pred_data_1[['Yield','CPI']]))
In [22]:
# Access the trained XGBoost model
xgb_model = models["XGBoost"]
In [23]:
# Predict new values using the trained model
y_predict_1 = xgb_model.predict(X_pred_scaled_1)
pred_data_1['Market_Value'] = y_predict_1
print(pred_data_1)
    Year  Yield    CPI  Market_Value
24  2024    8.1  272.2    540.051819
25  2025    7.6   70.9    540.052307
26  2026    7.7   21.2    446.820984
27  2027    7.7   11.8    132.953583
28  2028    7.8    5.0    111.102394

Interpretation:¶

Between 2024 and 2028, market value declines significantly despite stable yields. In 2024, a high CPI of 272.2 corresponds with a market value of USD 540.05 million. By 2025, although the CPI drops to 70.9, the market value remains stable at USD 540.05 million. In 2026, with a CPI of 21.2, the market value decreases to USD 446.82 million. The trend continues in 2027 with the market value dropping sharply to USD 132.95 million as the CPI falls to 11.8. By 2028, the market value reaches USD 111.10 million with a low CPI of 5.0. The decreasing CPI, indicating deflation, appears to significantly impact market values.

Combination 2: [['Yield', 'Area_Harvested']]¶

In [24]:
# Prepare features and target
X_1 = filtered_data[['Yield', 'Area_Harvested']]
y_1 = filtered_data['Market_Value_New']

# Determine the split point
split_point = int(len(filtered_data) * 0.7)

# Split data into training and test sets
X_train_1 = X_1[:split_point]
X_test_1 = X_1[split_point:]
y_train_1 = y_1[:split_point]
y_test_1 = y_1[split_point:]

# Standardize the features
scaler = StandardScaler()
X_train_scaled_1 = scaler.fit_transform(X_train_1)
X_test_scaled_1 = scaler.transform(X_test_1)


# Initialize models
models = {
    "Linear Regression": LinearRegression(),
    "Gradient Boosting": GradientBoostingRegressor(random_state=0),
    "Support Vector Regression": SVR(),
    "XGBoost": xgb.XGBRegressor(random_state=0),
    "Neural Network": MLPRegressor(random_state=0, max_iter=1000)
}

# Fit models and evaluate
for name, model in models.items():
    model.fit(X_train_scaled_1, y_train_1)
    y_pred_1 = model.predict(X_test_scaled_1)
    print(f"{name}")
    print("Actual Values:   ", y_test_1.values)
    print("Predicted Values:", y_pred_1)
    print("Mean Squared Error:", mean_squared_error(y_test_1, y_pred_1))
    print("Root Mean Squared Error:", np.sqrt(mean_squared_error(y_test_1, y_pred_1)))
    print("Mean Absolute Error:", mean_absolute_error(y_test_1, y_pred_1))
    print("R^2 Score:", r2_score(y_test_1, y_pred_1))
    print("-" * 30)
Linear Regression
Actual Values:    [498.37156642 509.24512787 525.55547005 572.67423633 612.16901125
 625.         653.9194169  682.83883379]
Predicted Values: [381.10274737 614.26813431 568.9890942  829.42128006 851.77095679
 887.58840823 945.67272402 806.36769811]
Mean Squared Error: 39916.05861992978
Root Mean Squared Error: 199.79003633797603
Mean Absolute Error: 179.99312732449633
R^2 Score: -8.506841071789347
------------------------------
Gradient Boosting
Actual Values:    [498.37156642 509.24512787 525.55547005 572.67423633 612.16901125
 625.         653.9194169  682.83883379]
Predicted Values: [456.18455633 431.57089064 199.99754744 431.57089064 428.6522523
 430.57524259 421.29913656 301.86127223]
Mean Squared Error: 50555.82880401083
Root Mean Squared Error: 224.84623368873855
Mean Absolute Error: 197.25773423525322
R^2 Score: -11.040924036832337
------------------------------
Support Vector Regression
Actual Values:    [498.37156642 509.24512787 525.55547005 572.67423633 612.16901125
 625.         653.9194169  682.83883379]
Predicted Values: [135.05359887 132.32778516 132.30816009 132.2267862  132.22674094
 132.22672353 132.22672088 132.22672107]
Mean Squared Error: 208901.38556081773
Root Mean Squared Error: 457.0573110243591
Mean Absolute Error: 452.36880323404046
R^2 Score: -48.75421774763337
------------------------------
XGBoost
Actual Values:    [498.37156642 509.24512787 525.55547005 572.67423633 612.16901125
 625.         653.9194169  682.83883379]
Predicted Values: [457.23843 448.26352 182.2107  448.26352 448.26352 448.26517 421.3505
 283.40018]
Mean Squared Error: 51314.26362951942
Root Mean Squared Error: 226.5265186010667
Mean Absolute Error: 192.814767365053
R^2 Score: -11.221561093663905
------------------------------
Neural Network
Actual Values:    [498.37156642 509.24512787 525.55547005 572.67423633 612.16901125
 625.         653.9194169  682.83883379]
Predicted Values: [312.34901483 566.74702553 457.331885   790.98686945 805.79393293
 837.45998432 872.51691429 679.21682571]
Mean Squared Error: 27581.695164641125
Root Mean Squared Error: 166.07737704046608
Mean Absolute Error: 144.79563486210975
R^2 Score: -5.569155409794445
------------------------------

Model Performance Summary Result 2¶

Model Mean Squared Error (MSE) Root Mean Squared Error (RMSE) Mean Absolute Error (MAE) R² Score
Linear Regression 39916.06 199.79 179.99 -8.5068
Gradient Boosting 50555.83 224.85 197.26 -11.0409
Support Vector Regression 208901.39 457.06 452.37 -48.7542
XGBoost 51314.26 226.53 192.81 -11.2216
Neural Network 27581.70 166.08 144.80 -5.5692

Insights:¶

  • Linear Regression:

    • The model demonstrates significant deviations between actual and predicted values.
    • The high MSE and RMSE indicate substantial errors, and the negative R² Score suggests a poor fit.
  • Gradient Boosting:

    • This model shows a slightly better fit than Linear Regression but still suffers from high errors and a negative R² Score.
  • Support Vector Regression:

    • The model performs poorly, with very high MSE and RMSE, and nearly constant predicted values. The extremely negative R² Score reflects its unsuitability for this dataset.
  • XGBoost:

    • While slightly better than Gradient Boosting, XGBoost also exhibits high errors and a negative R² Score, indicating poor predictive performance.
  • Neural Network:

    • The Neural Network model performs the best among the tested models, with the lowest MSE, RMSE, and MAE. However, the negative R² Score still indicates a poor overall fit.

Summary:¶

  • All models show poor performance with negative R² scores, indicating that none of the models effectively capture the underlying patterns in the data.
  • Neural Network performs relatively better compared to other models but still does not provide an adequate fit.
  • Support Vector Regression and XGBoost show the highest errors and are not suitable for this dataset.
  • Further analysis and optimization, including feature engineering and model tuning, are needed to improve predictive performance.

Prediction for next 5 years (Combination 2: [['Yield', 'Area_Harvested']])¶

In [25]:
# filtering the data required for prediction
pred_data_2 = merged_data.iloc[24:][['Year','Yield','Area_Harvested']]
print(pred_data_2)
    Year  Yield  Area_Harvested
24  2024    8.1     6800.000000
25  2025    7.6     7200.000000
26  2026    7.7     7132.197788
27  2027    7.7     7221.251348
28  2028    7.8     7244.102280
In [26]:
# Scaling the data
X_pred_scaled_2 = scaler.transform((pred_data_2[['Yield','Area_Harvested']]))
In [27]:
# Access the trained Neural Network model
nn_model = models["Neural Network"]
In [28]:
# Predict new values using the trained model
y_predict_2 = nn_model.predict(X_pred_scaled_2)
pred_data_2['Predicted Market Value (USD)'] = y_predict_2
print(pred_data_2)
    Year  Yield  Area_Harvested  Predicted Market Value (USD)
24  2024    8.1     6800.000000                    899.215950
25  2025    7.6     7200.000000                    936.340831
26  2026    7.7     7132.197788                    931.194935
27  2027    7.7     7221.251348                    947.834032
28  2028    7.8     7244.102280                    959.626104

Interpretation:¶

Between 2024 and 2028, the trained Neural Network model predicts a steady increase in market value, reaching USD 959.63 million by 2028. Yield remains relatively stable, fluctuating between 7.6 and 8.1 tons/ha, while the area harvested grows from 6800 to 7244 hectares. This suggests that despite minor variations in yield and harvested area, the market value is expected to rise consistently over the period.

Regression using random sampling¶

In response to poor ex post forecasting results using regression, random sampling was applied to enhance model validation by ensuring that training and test data are representative of the entire dataset.

Combination 1: [['Yield', 'CPI']]¶

In [29]:
import joblib

np.random.seed(0)

# Prepare features and target
X_2 = filtered_data[['Yield', 'CPI_New']]
y_2 = filtered_data['Market_Value_New']

# Split data into training and test sets
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_2, y_2, test_size=0.2, random_state=0)

# Standardize the features
scaler = StandardScaler()
X_train_scaled_2 = scaler.fit_transform(X_train_2)
X_test_scaled_2 = scaler.transform(X_test_2)

# Initialize models
models = {
    "Linear Regression": LinearRegression(),
    "Gradient Boosting": GradientBoostingRegressor(random_state=0),
    "Support Vector Regression": SVR(),
    "XGBoost": xgb.XGBRegressor(random_state=0),
    "Neural Network": MLPRegressor(random_state=0, max_iter=1000)
}

# Fit models and evaluate
for name, model in models.items():
    model.fit(X_train_scaled_2, y_train_2)
    y_pred_2 = model.predict(X_test_scaled_2)
    print(f"{name}")
    print("Actual Values:   ", y_test_2.values)
    print("Predicted Values:", y_pred_2)
    print("Mean Squared Error:", mean_squared_error(y_test_2, y_pred_2))
    print("Root Mean Squared Error:", np.sqrt(mean_squared_error(y_test_2, y_pred_2)))
    print("Mean Absolute Error:", mean_absolute_error(y_test_2, y_pred_2))
    print("R^2 Score:", r2_score(y_test_2, y_pred_2))
    print("-" * 30)
Linear Regression
Actual Values:    [199.34862657 154.04212053 653.9194169  540.05355198 612.16901125]
Predicted Values: [211.79944641 276.5885912  783.59804879 347.52637127 535.05353473]
Mean Squared Error: 15000.543997917877
Root Mean Squared Error: 122.47670798122343
Mean Absolute Error: 106.86371592745158
R^2 Score: 0.6663162535794338
------------------------------
Gradient Boosting
Actual Values:    [199.34862657 154.04212053 653.9194169  540.05355198 612.16901125]
Predicted Values: [150.7034757  401.49489273 662.06813274 516.9090123  498.81220423]
Mean Squared Error: 15410.212431117656
Root Mean Squared Error: 124.13787669812004
Mean Absolute Error: 88.14959712064571
R^2 Score: 0.6572032709036508
------------------------------
Support Vector Regression
Actual Values:    [199.34862657 154.04212053 653.9194169  540.05355198 612.16901125]
Predicted Values: [168.97183059 173.1619798  173.89771811 173.14459834 176.25000889]
Mean Squared Error: 111271.34138684312
Root Mean Squared Error: 333.5735921604753
Mean Absolute Error: 266.4692620067271
R^2 Score: -1.4752061037491244
------------------------------
XGBoost
Actual Values:    [199.34862657 154.04212053 653.9194169  540.05355198 612.16901125]
Predicted Values: [163.1888  330.12357 625.01636 514.72876 498.37286]
Mean Squared Error: 9347.700745269864
Root Mean Squared Error: 96.6835081348927
Mean Absolute Error: 76.05305477570961
R^2 Score: 0.7920624874982588
------------------------------
Neural Network
Actual Values:    [199.34862657 154.04212053 653.9194169  540.05355198 612.16901125]
Predicted Values: [ 90.36357526 168.13778621 414.52421336 173.2148767  308.16612539]
Mean Squared Error: 59274.97219213003
Root Mean Squared Error: 243.46451937013333
Mean Absolute Error: 206.66349632978168
R^2 Score: -0.3185584997977562
------------------------------

Model Performance Summary Result 3¶

The results of the model evaluations with random sampling are as follows:

Model Mean Squared Error (MSE) Root Mean Squared Error (RMSE) Mean Absolute Error (MAE) R² Score
Linear Regression 15000.54 122.48 106.86 0.6663
Gradient Boosting 15410.21 124.14 88.15 0.6572
Support Vector Regression 111271.34 333.57 266.47 -1.4752
XGBoost 9347.70 96.68 76.05 0.7921
Neural Network 59274.97 243.46 206.66 -0.3186

Insights:¶

  • Linear Regression:

    • The model demonstrates moderate deviations between actual and predicted values.
    • The MSE and RMSE are relatively lower compared to other models, and the R² Score is positive (0.6663), indicating a decent fit but with room for improvement.
  • Gradient Boosting:

    • This model shows higher errors compared to Linear Regression, with a positive R² Score (0.6572) indicating some predictive capability but still not optimal.
  • Support Vector Regression:

    • The model performs poorly, with very high MSE and RMSE, and nearly constant predicted values. The negative R² Score (-1.4752) reflects its unsuitability for this dataset.
  • XGBoost:

    • XGBoost shows a better fit compared to Gradient Boosting with a positive R² Score (0.7921) and lower errors, suggesting it captures some underlying patterns in the data.
  • Neural Network:

    • The Neural Network model performs poorly with high MSE, RMSE, and MAE. The negative R² Score (-0.3186) indicates a poor overall fit, failing to capture the data patterns effectively.

Summary:¶

  • XGBoost performs relatively better compared to other models with the highest positive R² Score and the lowest MSE and RMSE.
  • Linear Regression also shows decent performance with a positive R² Score but higher errors compared to XGBoost.
  • Gradient Boosting has moderate performance with a positive R² but higher errors.
  • Support Vector Regression and Neural Network show the highest errors and negative R² scores, indicating they are not suitable for this dataset.

Prediction for next 5 Years (Combination 1: [['Yield', 'CPI']], random sampling)¶

In [30]:
# filtering the data required for prediction
pred_data_3 = merged_data.iloc[24:][['Year','Yield','CPI']]
print(pred_data_3)
    Year  Yield    CPI
24  2024    8.1  272.2
25  2025    7.6   70.9
26  2026    7.7   21.2
27  2027    7.7   11.8
28  2028    7.8    5.0
In [31]:
# Scaling the data
X_pred_scaled_3 = scaler.transform((pred_data_3)[['Yield','CPI']])
In [32]:
# Access the trained XGBoost model
xgb_model = models["XGBoost"]
In [33]:
# Predict new values using the trained model
y_predict_3 = xgb_model.predict(X_pred_scaled_3)
pred_data_3['Market_Value'] = y_predict_3
print(pred_data_3)
    Year  Yield    CPI  Market_Value
24  2024    8.1  272.2    572.647339
25  2025    7.6   70.9    624.999634
26  2026    7.7   21.2    514.728760
27  2027    7.7   11.8    127.414009
28  2028    7.8    5.0    124.699387
In [34]:
# Select and filter the relevant columns from merged_data till year 2023
merged_data_filtered = merged_data[merged_data['Year'] <= 2023][['Year', 'Yield', 'Market_Value', 'CPI']]

# Select and rename the relevant columns from pred_data_3
pred_data_filtered = pred_data_3[['Year', 'Yield', 'Market_Value', 'CPI']]

# Concatenate the DataFrames
combined_data = pd.concat([merged_data_filtered, pred_data_filtered], ignore_index=True)
print(combined_data)
    Year  Yield  Market_Value    CPI
0   2000    5.5    103.298834   -0.9
1   2001    5.5    114.172395   -1.1
2   2002    6.1     90.613012   25.9
3   2003    6.3    108.735614   13.4
4   2004    6.4    117.796916    4.4
5   2005    7.4    132.294998    9.6
6   2006    6.5     94.237533   10.9
7   2007    8.0    117.796916    8.8
8   2008    6.5    168.540202    8.6
9   2009    6.2    126.858217    6.3
10  2010    8.3    154.042121   10.5
11  2011    6.7    199.348627    9.8
12  2012    5.8    302.647460   10.0
13  2013    6.8    537.428899   10.6
14  2014    7.6    540.053552   21.4
15  2015    8.5    457.233259   16.4
16  2016    8.0    498.371566   37.5
17  2017    8.4    509.245128   25.4
18  2018    6.2    525.555470   34.2
19  2019    8.4    572.674236   52.8
20  2020    8.1    612.169011   40.5
21  2021    7.9    625.000000   47.1
22  2022    7.0    900.000000   73.1
23  2023    5.3    682.838834  134.0
24  2024    8.1    572.647339  272.2
25  2025    7.6    624.999634   70.9
26  2026    7.7    514.728760   21.2
27  2027    7.7    127.414009   11.8
28  2028    7.8    124.699387    5.0
In [58]:
# Given error values
mse = 9347.700745269864
rmse = 96.6835081348927
mae = 76.05305477570961
r2 = 0.7920624874982588

# Calculate the standard error
std_error = np.sqrt(mse)

# Calculate the confidence intervals (95%)
z_value = 1.96  # for 95% confidence
combined_data['Lower_Bound'] = combined_data['Market_Value'] - z_value * std_error
combined_data['Upper_Bound'] = combined_data['Market_Value'] + z_value * std_error

# Display the prediction intervals for the predicted years (2024 onwards)
predicted_years = combined_data[combined_data['Year'] >= 2024]
print(predicted_years[['Year', 'Market_Value', 'Lower_Bound', 'Upper_Bound']])

# Plot the actual and predicted market values
plt.figure(figsize=(10, 6))

# Plot actual values
plt.plot(combined_data['Year'], combined_data['Market_Value'], label='Actual Market Value', marker='o', color='blue')

# Highlight the predicted values from 2024 onwards in orange
predicted_years = combined_data[combined_data['Year'] >= 2024]
plt.plot(predicted_years['Year'], predicted_years['Market_Value'], label='Predicted Market Value', marker='x', linestyle='--', color='orange')

# Plot the prediction intervals
plt.scatter(predicted_years['Year'], predicted_years['Lower_Bound'], color='blue', marker='*', label='Lower PI')
plt.scatter(predicted_years['Year'], predicted_years['Upper_Bound'], color='blue', marker='*', label='Upper PI')
plt.fill_between(predicted_years['Year'], predicted_years['Lower_Bound'], predicted_years['Upper_Bound'], color='orange', alpha=0.3, label='Prediction Interval')

plt.xlabel('Year')
plt.ylabel('Market Value (USD)')
plt.title('Actual vs Predicted Market Value with Prediction Intervals')
plt.legend()
plt.grid(True)
plt.show()
    Year  Market_Value  Lower_Bound  Upper_Bound
24  2024    572.647339   383.147663   762.147015
25  2025    624.999634   435.499958   814.499310
26  2026    514.728760   325.229084   704.228436
27  2027    127.414009   -62.085667   316.913685
28  2028    124.699387   -64.800289   314.199063

Interpretation:¶

Between 2024 and 2028, the XGBoost model predicts a fluctuating market value, peaking at USD 625 million in 2025 before declining sharply to USD125 million by 2028. Yield remains stable around 7.6 to 8.1 tons/ha, while CPI significantly decreases from 272.2 to 5.0, indicating a deflationary trend. Despite consistent yields, the declining CPI appears to correlate with a substantial drop in market value towards the end of the period.

The predicted market values for the years 2024 to 2028 exhibit significant variability, as indicated by their corresponding prediction intervals. In 2024, the market value is predicted to be approximately USD 572.65, with a 95% confidence interval ranging from USD 383.15 to USD 762.15. Similarly, for 2025, the predicted value is around USD 625.00, with an interval from USD 435.50 to USD 814.50. However, in subsequent years, the intervals widen and include negative values, reflecting increased uncertainty. For instance, the 2027 prediction of USD 127.41 spans from USD -62.09 to USD 316.91, and in 2028, the value of USD 124.70 ranges from USD -64.80 to USD 314.20. These intervals highlight the model's diminishing confidence in long-term predictions, underscoring the potential volatility and uncertainty in future market values.

Combination 2: [['Yield', 'Area_Harvested']]¶

In [36]:
np.random.seed(0)

# Prepare features and target
X_3 = filtered_data[['Yield', 'Area_Harvested']]
y_3 = filtered_data['Market_Value_New']

# Split data into training and test sets
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(X_3, y_3, test_size=0.2, random_state=0)

# Standardize the features
scaler = StandardScaler()
X_train_scaled_3 = scaler.fit_transform(X_train_3)
X_test_scaled_3 = scaler.transform(X_test_3)

# Initialize models
models = {
    "Linear Regression": LinearRegression(),
    "Gradient Boosting": GradientBoostingRegressor(random_state=0),
    "Support Vector Regression": SVR(),
    "XGBoost": xgb.XGBRegressor(random_state=0),
    "Neural Network": MLPRegressor(random_state=0, max_iter=1000)
}

# Fit models and evaluate
for name, model in models.items():
    model.fit(X_train_scaled_3, y_train_3)
    y_pred_3 = model.predict(X_test_scaled_3)
    print(f"{name}")
    print("Actual Values:   ", y_test_3.values)
    print("Predicted Values:", y_pred_3)
    print("Mean Squared Error:", mean_squared_error(y_test_3, y_pred_3))
    print("Root Mean Squared Error:", np.sqrt(mean_squared_error(y_test_3, y_pred_3)))
    print("Mean Absolute Error:", mean_absolute_error(y_test_3, y_pred_3))
    print("R^2 Score:", r2_score(y_test_3, y_pred_3))
    print("-" * 30)
Linear Regression
Actual Values:    [199.34862657 154.04212053 653.9194169  540.05355198 612.16901125]
Predicted Values: [298.40307779 257.07339817 747.39466084 284.55982473 681.77063886]
Mean Squared Error: 19857.25618819813
Root Mean Squared Error: 140.91577693146402
Mean Absolute Error: 124.13126553474847
R^2 Score: 0.5582797770913748
------------------------------
Gradient Boosting
Actual Values:    [199.34862657 154.04212053 653.9194169  540.05355198 612.16901125]
Predicted Values: [436.37243614 115.08548697 659.84278816 177.17403284 573.25046317]
Mean Squared Error: 38185.83814483376
Root Mean Squared Error: 195.41197032125172
Mean Absolute Error: 136.74037632243926
R^2 Score: 0.15056457058183204
------------------------------
Support Vector Regression
Actual Values:    [199.34862657 154.04212053 653.9194169  540.05355198 612.16901125]
Predicted Values: [170.07935442 171.90944137 173.45048622 171.17437544 174.69577081]
Mean Squared Error: 111896.20155355385
Root Mean Squared Error: 334.5088960753568
Mean Absolute Error: 266.7915881274097
R^2 Score: -1.489105978410067
------------------------------
XGBoost
Actual Values:    [199.34862657 154.04212053 653.9194169  540.05355198 612.16901125]
Predicted Values: [437.5803  118.23085 620.2428  119.82225 572.67896]
Mean Squared Error: 47464.939887327215
Root Mean Squared Error: 217.86449891464008
Mean Absolute Error: 153.48818246284506
R^2 Score: -0.055846972445047216
------------------------------
Neural Network
Actual Values:    [199.34862657 154.04212053 653.9194169  540.05355198 612.16901125]
Predicted Values: [129.7573991  173.93873192 447.76964348 152.29159375 441.44798985]
Mean Squared Error: 45448.30931272621
Root Mean Squared Error: 213.18609080501994
Mean Absolute Error: 170.8241183807458
R^2 Score: -0.010987476324604728
------------------------------

Model Performance Summary Result 4¶

The results of the model evaluations with random sampling are as follows:

Model Mean Squared Error (MSE) Root Mean Squared Error (RMSE) Mean Absolute Error (MAE) R² Score
Linear Regression 19857.26 140.92 124.13 0.5583
Gradient Boosting 38185.84 195.41 136.74 0.1506
Support Vector Regression 111896.20 334.51 266.79 -1.4891
XGBoost 47464.94 217.86 153.49 -0.0558
Neural Network 45448.31 213.19 170.82 -0.0110

Insights:¶

  • Linear Regression:

    • The model demonstrates moderate deviations between actual and predicted values.
    • The relatively low MSE and RMSE, and the highest positive R² Score (0.5583) among all models indicate that this model fits the data best compared to others.
  • Gradient Boosting:

    • This model shows higher errors compared to Linear Regression, with a positive R² Score (0.1506) indicating some predictive capability but still not optimal.
  • Support Vector Regression:

    • The model performs poorly, with very high MSE and RMSE, and nearly constant predicted values. The negative R² Score (-1.4891) reflects its unsuitability for this dataset.
  • XGBoost:

    • XGBoost shows a better fit compared to Gradient Boosting but still has high errors and a low negative R² Score (-0.0558), suggesting it captures some underlying patterns in the data but not effectively.
  • Neural Network:

    • The Neural Network model performs poorly with high MSE, RMSE, and MAE. The negative R² Score (-0.0110) indicates a poor overall fit, failing to capture the data patterns effectively.

Summary:¶

  • Linear Regression performs relatively better compared to other models with the highest positive R² Score and the lowest MSE and RMSE, indicating a better fit.
  • Gradient Boosting and XGBoost show moderate performance with positive R² scores but higher errors compared to Linear Regression.
  • Support Vector Regression and Neural Network show the highest errors and negative R² scores, indicating they are not suitable for this dataset.
  • Further analysis and optimization, including feature engineering and model tuning, are needed to improve predictive performance.
In [41]:
# filtering the data required for prediction
pred_data_4 = merged_data.iloc[24:][['Year','Yield','Area_Harvested']]
print(pred_data_4)
    Year  Yield  Area_Harvested
24  2024    8.1     6800.000000
25  2025    7.6     7200.000000
26  2026    7.7     7132.197788
27  2027    7.7     7221.251348
28  2028    7.8     7244.102280
In [42]:
# Scaling the data
X_pred_scaled_4 = scaler.transform((pred_data_4[['Yield','Area_Harvested']]))
In [43]:
# Access the trained Linear regresssion model
Lr_model = models["Linear Regression"]
In [44]:
# Predict new values using the trained model
y_predict_4 = nn_model.predict(X_pred_scaled_4)
pred_data_4['Market_Value'] = y_predict_4
print(pred_data_4)
    Year  Yield  Area_Harvested  Market_Value
24  2024    8.1     6800.000000    486.162202
25  2025    7.6     7200.000000    492.329445
26  2026    7.7     7132.197788    492.188767
27  2027    7.7     7221.251348    500.166677
28  2028    7.8     7244.102280    508.146956

Interpretation:¶

Using the Linear Regression model, the predicted market value shows a steady increase from $486.16 million in 2024 to $508.15 million in 2028. Yield remains stable between 7.6 and 8.1 tons/ha, while the area harvested increases slightly from 6800 to 7244 hectares. This indicates that the model forecasts a gradual rise in market value despite relatively consistent yield and harvested area.

Summarising all above results:¶

  1. Result 1: XGBoost performs the best with the lowest MSE, RMSE, and MAE, although it has a negative R² score.
  2. Result 2: Neural Network performs the best with the lowest MSE, RMSE, and MAE, but again, with a negative R² score.
  3. Result 3: XGBoost performs the best with the lowest MSE, RMSE, and MAE, and a positive R² score of 0.7921.
  4. Result 4: Linear Regression performs the best with the lowest MSE, RMSE, and MAE, and a positive R² score of 0.5583.

Given these results, Result 3 with XGBoost stands out as the best choice for forecasting the future 5 years because it has the lowest errors and the highest positive R² score, indicating a good fit and better predictive capability.

In Result 3 we used Random sampling and Combination 1 for conduct regression analysis

Summary for Best Model:¶

Result 3 with XGBoost:

  • Mean Squared Error (MSE): 9347.70
  • Root Mean Squared Error (RMSE): 96.68
  • Mean Absolute Error (MAE): 76.05
  • R² Score: 0.7921

This model and result combination demonstrates the best performance and should be used for making forecasts for the next 5 years.

Note: The visualization of the results was performed exclusively for this model due to its relative best performance.

Forecasting based on Market Value¶

Autocorrelation: Autocorrelation, also known as serial correlation, measures the correlation of a time series with its own past values. It is a crucial aspect in time series analysis and forecasting because it helps identify patterns, trends, and seasonality in the data.

Importance in Forecasting:

  1. Pattern Identification: Autocorrelation helps in detecting repeating patterns or cycles within the data, which are critical for building accurate forecasting models.
  2. Model Selection: The presence and strength of autocorrelation can guide the selection of appropriate forecasting models (e.g., ARIMA models leverage autocorrelation for better forecasts).
  3. Parameter Estimation: In models like ARIMA, the autocorrelation function (ACF) and partial autocorrelation function (PACF) plots are used to determine the parameters (p) and (q).
In [45]:
from statsmodels.graphics.tsaplots import plot_acf
from scipy.stats import boxcox

# Identify and replace the outlier (assuming the spike is at index 22 with value 900)
outlier_index = 22
neighbors_indices = [outlier_index - 1, outlier_index + 1]

# Calculate the median of the neighboring points
neighbors_values = filtered_data.loc[neighbors_indices, 'Market_Value_New']
median_value = neighbors_values.median()

# Replace the outlier with the median value
filtered_data.at[outlier_index, 'Market_Value_New'] = median_value

# Make the data stationary using Box-Cox transformation and differencing
filtered_data['Market_Value_boxcox'], lam = boxcox(filtered_data['Market_Value_New'])
filtered_data['Market_Value_stationary'] = filtered_data['Market_Value_boxcox'].diff().dropna()

# Plot autocorrelation
plt.rc("figure", figsize=(11, 5))
plot_acf(filtered_data['Market_Value_stationary'].dropna())
plt.xlabel('Lags', fontsize=18)
plt.ylabel('Correlation', fontsize=18)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)
plt.title('Autocorrelation Plot', fontsize=20)
plt.tight_layout()
plt.show()

Interpreting the Graph: The provided graph appears to be an Autocorrelation Function (ACF) plot, which shows the correlation of the time series with its lags.

  • Lag: The horizontal axis represents the lag, which is the number of time steps back.
  • ACF: The vertical axis represents the autocorrelation coefficient at each lag.
  • Blue Bars: The blue bars indicate the autocorrelation coefficient for each lag. If the bars extend beyond the shaded area (confidence interval), it indicates significant autocorrelation at that lag.
  • Shaded Area: The shaded area represents the confidence interval. If the autocorrelation values are within this area, they are not significantly different from zero at a given confidence level.

Graph Interpretation:

  1. Significant Lag: The first lag (lag 1) shows a significant positive autocorrelation, indicating that the value of the series at time (t) is positively correlated with its value at time (t-1).
  2. Subsequent Lags: The autocorrelations at subsequent lags mostly fall within the confidence interval, suggesting that there is no significant autocorrelation at those lags.
  3. Decay Pattern: The decay of autocorrelation values to within the confidence interval suggests that the influence of past values diminishes quickly.

Conclusion:

  • The significant autocorrelation at lag 1 suggests that a simple model like AR(1) (AutoRegressive model of order 1) might be appropriate.
  • The quick decay of autocorrelation values indicates that the series does not have a long-term dependency structure, simplifying the modeling process.

Understanding and interpreting autocorrelation is essential for selecting the correct forecasting model and improving the accuracy of predictions.

Naive Forecast Model¶

Naive Model:

The naive forecast model is a simple time series forecasting method that assumes the value at the next time step is equal to the value at the current time step plus the change observed in the most recent time step. It is particularly useful for data with a strong trend or seasonal patterns.

Why Choose the Naive Model?

  1. Simplicity: The naive model is easy to implement and understand.
  2. Baseline Comparison: It serves as a benchmark to compare the performance of more complex models.
  3. Applicability: Useful for datasets with clear trends or seasonality, where the most recent observation is a good predictor of the next one.

Mathematical Formula

The naive forecast model can be expressed mathematically as follows:

\[ \hat{y}_{t+1} = y_t + (y_t - y_{t-1}) \]

Where:

  • (\hat{y}_{t+1})
    is the forecasted value for the next time step.
  • (y_t)
    is the value at the current time step.
  • (y_{t-1})
    is the value at the previous time step.

This formula reflects the idea that the forecast for the next period is the current value plus the most recent change observed in the time series.

This function iterates over the dataset, applying the naive forecasting formula to generate forecasted values. The first forecast is undefined because it requires a previous time step to calculate the change.

In [46]:
# Implement the naive model
def naive_forecast(df, column_name):
    forecasted_values = [None]  # The first forecast is undefined
    for t in range(1, len(df)):
        y_t = df[column_name].iloc[t]
        y_t_minus_1 = df[column_name].iloc[t - 1]
        y_hat_t_plus_1 = y_t + (y_t - y_t_minus_1)
        forecasted_values.append(y_hat_t_plus_1)
    return forecasted_values

# Apply the naive forecast
filtered_data['Market_Value_Forecast'] = naive_forecast(filtered_data, 'Market_Value_New')

# Keep the first forecasted value the same as the actual value for the first year
filtered_data.loc[0, 'Market_Value_Forecast'] = filtered_data.loc[0, 'Market_Value_New']

# Calculate evaluation metrics for the period where predictions are available
filtered_data = filtered_data.dropna(subset=['Market_Value_Forecast'])
mae = mean_absolute_error(filtered_data['Market_Value_New'][1:], filtered_data['Market_Value_Forecast'][1:])
mse = mean_squared_error(filtered_data['Market_Value_New'][1:], filtered_data['Market_Value_Forecast'][1:])
rmse = np.sqrt(mse)

print(f'Mean Absolute Error (MAE): {mae}')
print(f'Mean Squared Error (MSE): {mse}')
print(f'Root Mean Squared Error (RMSE): {rmse}')

# Extend the forecast for the next 5 years
last_value = filtered_data['Market_Value_New'].iloc[-1]
last_forecasted_value = filtered_data['Market_Value_Forecast'].iloc[-1]
forecasted_values = [last_forecasted_value]

for i in range(5):
    next_forecast = forecasted_values[-1] + (forecasted_values[-1] - last_value)
    forecasted_values.append(next_forecast)
    last_value = forecasted_values[-2]

# Create a DataFrame for the extended forecast
forecast_years = range(filtered_data['Year'].max() + 1, filtered_data['Year'].max() + 6)
forecast_df = pd.DataFrame({'Year': forecast_years, 'Market_Value_Forecast': forecasted_values[1:]})

# Calculate the confidence interval
confidence_interval = 1.96 * rmse
forecast_df['Lower_CI'] = forecast_df['Market_Value_Forecast'] - confidence_interval
forecast_df['Upper_CI'] = forecast_df['Market_Value_Forecast'] + confidence_interval

# Output the forecasted values
print(forecast_df)

# Plot the actual and forecasted values along with the confidence interval
plt.figure(figsize=(10, 6))
plt.plot(filtered_data['Year'], filtered_data['Market_Value_New'], label='Actual Market Value', marker='o')
plt.plot(filtered_data['Year'], filtered_data['Market_Value_Forecast'], label='Forecasted Market Value', marker='x', linestyle='--')
plt.plot(forecast_df['Year'], forecast_df['Market_Value_Forecast'], label='Extended Forecast', marker='x', linestyle='--')
plt.scatter(forecast_df['Year'], forecast_df['Lower_CI'], color='blue', marker='*', label='Lower PI')
plt.scatter(forecast_df['Year'], forecast_df['Upper_CI'], color='blue', marker='*', label='Upper PI')
plt.fill_between(forecast_df['Year'], forecast_df['Lower_CI'], forecast_df['Upper_CI'], color='gray', alpha=0.2, label='95% Prediction Interval')
plt.xlabel('Year')
plt.ylabel('Market Value')
plt.title('Actual vs Forecasted Market Value with Confidence Interval')
plt.legend()
plt.grid(True)
plt.show()
Mean Absolute Error (MAE): 41.3816632015416
Mean Squared Error (MSE): 2774.171766262752
Root Mean Squared Error (RMSE): 52.670406930863486
   Year  Market_Value_Forecast    Lower_CI    Upper_CI
0  2024             740.677668  637.443670  843.911665
1  2025             769.597084  666.363087  872.831082
2  2026             798.516501  695.282504  901.750499
3  2027             827.435918  724.201921  930.669916
4  2028             856.355335  753.121338  959.589333

Forecast Results: The naive forecast model predicts the following market values for the next five years, with associated 95% confidence intervals:

Year Market_Value_Forecast Lower_PI Upper_PI
2024 740.68 637.44 843.91
2025 769.60 666.36 872.83
2026 798.52 695.28 901.75
2027 827.44 724.20 930.67
2028 856.36 753.12 959.59

Model Evaluation Metrics for Naive Forecast Model:

  • Mean Absolute Error (MAE): 41.38
  • Mean Squared Error (MSE): 2774.17
  • Root Mean Squared Error (RMSE): 52.67

These metrics indicate the average magnitude of errors in the forecast. MAE provides the average error in absolute terms, MSE emphasizes larger errors by squaring them, and RMSE offers a measure of the standard deviation of the residuals.

Interpretation: The forecasted values show a steady increase in market value over the next five years, with the confidence intervals widening over time, reflecting increasing uncertainty in long-term predictions. The naive forecast model relies heavily on the most recent observations, assuming the same rate of change continues into the future.

Conclusion: The model's performance metrics suggest moderate accuracy, with the errors in forecasted values being relatively consistent. The increasing trend in forecasted market values aligns with historical data patterns, but the widening confidence intervals indicate growing uncertainty. This suggests the need for cautious interpretation of long-term forecasts and consideration of additional models for more robust predictions.

In [47]:
from statsmodels.tsa.holtwinters import Holt
import plotly.graph_objects as go

# Split train and test
train = filtered_data.iloc[:-int(len(filtered_data) * 0.2)]
test = filtered_data.iloc[-int(len(filtered_data) * 0.2):]

def plot_func(forecast1: list[float], title: str) -> None:
    """Function to plot the forecasts."""
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=train['Year'], y=train['Market_Value_New'], name='Train'))
    fig.add_trace(go.Scatter(x=test['Year'], y=test['Market_Value_New'], name='Test'))
    fig.add_trace(go.Scatter(x=test['Year'], y=forecast1, name="Holt's Linear"))
    fig.update_layout(template="simple_white", font=dict(size=18), title_text=title,
                      width=700, title_x=0.5, height=400, xaxis_title='Year',
                      yaxis_title='Market Value (USD)')
    return fig.show()

# Fit Holt's model and get forecasts
model_holt = Holt(train['Market_Value_New'], damped_trend=True).fit(optimized=True)
forecasts_holt = model_holt.forecast(len(test))



# Calculate and print optimized parameters
alpha_holt = model_holt.params['smoothing_level']
beta_holt = model_holt.params.get('smoothing_trend', 'Not available')


print(f'Holt - Optimized alpha: {alpha_holt}')
print(f'Holt - Optimized beta: {beta_holt}')

# Plot the forecasts
plot_func(forecasts_holt, "Holt's Exponential Smoothing")
Holt - Optimized alpha: 0.9999999850988388
Holt - Optimized beta: 0.0

Holt's Method¶

Holt's Linear Trend Model: Holt's method extends simple exponential smoothing by including a trend component and hence called double exponential smoothening, using two parameters:

  • Alpha (α): Smoothing parameter for the level.
  • Beta (β): Smoothing parameter for the trend.

Why Holt's Method:

  • Trend Handling: Suitable for data with a linear trend.
  • Simplicity: Easy to implement and understand.
  • Flexibility: Accounts for both level and trend, enhancing forecast accuracy.

But we discard using this method:

  • Zero Beta Value: Optimized beta is zero, reducing it to simple exponential smoothing.
  • Overfitting Concerns: High alpha value may cause overfitting, leading to unstable forecasts if the recent trend changes.
In [48]:
from statsmodels.tsa.ar_model import AutoReg

# Fit an AR(1) model to the data
model_ar1 = AutoReg(filtered_data['Market_Value'], lags=1).fit()

# Print the summary of the AR(1) model
print(model_ar1.summary())

# Forecast the next 5 years
forecast_periods = 5
forecast_years = np.arange(filtered_data['Year'].max() + 1, filtered_data['Year'].max() + 6)
forecasts_ar1 = model_ar1.predict(start=len(filtered_data), end=len(filtered_data) + forecast_periods - 1, dynamic=False)

# Calculate the confidence intervals (assuming normal distribution of residuals)
forecast_std = np.std(model_ar1.resid)
confidence_interval = 1.96 * forecast_std
lower_ci = forecasts_ar1 - confidence_interval
upper_ci = forecasts_ar1 + confidence_interval

# Create a DataFrame for the forecast and prediction intervals
forecast_df = pd.DataFrame({
    'Year': forecast_years,
    'Forecast': forecasts_ar1,
    'Lower_CI': lower_ci,
    'Upper_CI': upper_ci
})

# Print the forecasted values and prediction intervals
print(forecast_df)

# Function to plot the forecasts
def plot_forecast_with_intervals(train, forecast_df, title):
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=train['Year'], y=train['Market_Value'], name='Actual'))
    fig.add_trace(go.Scatter(x=forecast_df['Year'], y=forecast_df['Forecast'], name='Forecast'))
    fig.add_trace(go.Scatter(x=forecast_df['Year'], y=forecast_df['Lower_CI'], name='Lower PI', mode='lines', line=dict(dash='dash')))
    fig.add_trace(go.Scatter(x=forecast_df['Year'], y=forecast_df['Upper_CI'], name='Upper PI', mode='lines', line=dict(dash='dash')))
    fig.update_layout(template="simple_white", font=dict(size=18), title_text=title,
                      width=700, title_x=0.5, height=400, xaxis_title='Year',
                      yaxis_title='Market Value (USD)')
    return fig.show()

# Plot the forecast with prediction intervals
plot_forecast_with_intervals(filtered_data, forecast_df, "AR(1) Model Forecast with Prediction Intervals")
                            AutoReg Model Results                             
==============================================================================
Dep. Variable:           Market_Value   No. Observations:                   24
Model:                     AutoReg(1)   Log Likelihood                -136.560
Method:               Conditional MLE   S.D. of innovations             91.695
Date:                Fri, 12 Jul 2024   AIC                            279.121
Time:                        12:05:06   BIC                            282.527
Sample:                             1   HQIC                           279.978
                                   24                                         
===================================================================================
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
const              43.5676     33.123      1.315      0.188     -21.353     108.488
Market_Value.L1     0.9445      0.082     11.551      0.000       0.784       1.105
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            1.0588           +0.0000j            1.0588            0.0000
-----------------------------------------------------------------------------
    Year    Forecast    Lower_CI    Upper_CI
24  2024  688.485188  508.762559  868.207816
25  2025  693.817973  514.095345  873.540602
26  2026  698.854604  519.131976  878.577233
27  2027  703.611528  523.888900  883.334157
28  2028  708.104278  528.381649  887.826906

Technical Interpretation of AR(1) Model Results¶

Model Summary: The AR(1) model results indicate that the market value at time (t) is primarily influenced by its value at time (t-1), with an added constant term.

Key Statistics:

  • Number of Observations: 24
  • Log Likelihood: -136.560
  • AIC (Akaike Information Criterion): 279.121
  • BIC (Bayesian Information Criterion): 282.527
  • S.D. of innovations (Standard Deviation of Residuals): 91.695

Coefficients:

  • Constant (const): 43.5676
    • Standard Error: 33.123
    • z-value: 1.315
    • P>|z|: 0.188
    • 95% Confidence Interval: [-21.353, 108.488]
  • Lag 1 (Market_Value.L1): 0.9445
    • Standard Error: 0.082
    • z-value: 11.551
    • P>|z|: 0.000
    • 95% Confidence Interval: [0.784, 1.105]

Roots of the Characteristic Polynomial:

  • The root is 1.0588, indicating the model is stable.

Forecast Results:¶

The forecasts and their 95% prediction intervals for the next five years are as follows:

Year Forecast Lower_PI Upper_PI
2024 688.485188 508.762559 868.207816
2025 693.817973 514.095345 873.540602
2026 698.854604 519.131976 878.577233
2027 703.611528 523.888900 883.334157
2028 708.104278 528.381649 887.826906

Interpretation:

  1. Model Coefficients:

    • The constant term (43.5676) is not statistically significant at the 5% level (P>|z| = 0.188), indicating that the constant does not provide substantial predictive power.
    • The lagged market value (Market_Value.L1) coefficient (0.9445) is highly significant (P>|z| = 0.000), suggesting that the market value at time (t-1) strongly influences the market value at time (t).
  2. Forecasts:

    • The model predicts a steady increase in the market value over the next five years.
    • The confidence intervals show the range within which the true market values are expected to lie with 95% certainty. The widening of the intervals reflects increasing uncertainty as the forecast horizon extends further into the future.
  3. Model Performance:

    • The relatively low AIC and BIC values suggest a good model fit.
    • The high standard error of the constant term suggests some variability in the data that is not captured by the model.

Conclusion:¶

The AR(1) model indicates a strong dependence on the previous year's market value, suggesting a continuation of the existing trend. The forecasts show a steady increase in market value with increasing uncertainty over time. This model can be useful for short-term forecasting but should be complemented with other models or analyses for long-term predictions due to increasing uncertainty.

Final Verdict¶

After comparing various machine learning regression models and traditional forecasting methods, it has been determined that machine learning model XGBoost yields better results for our given dataset. This superiority arises because forecasting models primarily consider only the market value data, whereas regression models account for the impact of multiple variables on the market value. Consequently, regression models provide more accurate forecasts by incorporating these additional factors.